Spark Lab Assignment

General instructions

The purpose of this assignment is to develop some basic Spark skills. The assignment is composed by two mandatory tasks and one challenge. Only PhD students are required to go through the challenge part. However, completing the challenges will improve the final grade for Master students too.

Warning: all of the tasks must be solved using the Spark RDD API, in order to distribute the computations in the Spark workers.

How to get help for this assignment

GitHub is a web-based software development collaboration tool. It is very important for your career that you get familiar with it. This is why Q&A for this assignment will be based on GitHub issues. If you encounter any problem setting up the environment, or you need some additional pointers to solve the tasks please open an issue here: https://github.com/SNICScienceCloud/LDSA-Spark/issues.

Getting started

Setup the environment

Before you begin with the assignment you need to get the Docker-based Spark deployment that we have seen in the lecture. You can either install the environment on the SNIC cloud or on your local computer. For Linux and Mac users, installing the environment locally should be straightforward, while on Windows machines the installation could be slightly more challenging. If you are unsure on how to proceed, the best way is to deploy the envirnment on SNIC.

Setup the environment on the SNIC cloud (recommended)

  1. Log into https://hpc2n.cloud.snic.se
  2. Start an instance from TheSparkBox image:

    • Type an instance name
    • Use the scc.large flavor
    • Select the internal IPv4 network
    • Add the Spark security group
    • Select your Key Pair
    • Launch the instance
  3. Associate a floating IP to the newly create instance, and SSH into it:

    ssh core@<your-floating-ip>
  4. Deploy Spark and Jupyter running:

    export SPARK_PUBLIC_DNS="<your-floating-ip>"
    export TSB_JUPYTER_TOKEN="<your-password>"
    tsb up -d

If everithing went well, within a couple of minutes you should be able to access the web UIs.

  • Spark UI: http://<your-floating-ip>:8080
  • Jupyter: http://<your-floating-ip>:8888

Setup the environment locally

If you like to work on you local computer, and you can install Docker, here you can find detailed instruction to get the environment on your machine: https://github.com/mcapuccini/TheSparkBox. The procedure should be straightforward for Linux and Mac, however on Windows this could a little bit more challenging.

Import the course material

The material for the Spark module in this course is stored in this GitHub repository: https://github.com/SNICScienceCloud/LDSA-Spark. You can import it in your deployment following this steps:

  1. Log into Jupyter
  2. Start a new Jupyter terminal: New > Terminal
  3. Clone the course material running:

    git clone https://github.com/SNICScienceCloud/LDSA-Spark.git

If everithing went well, you should be able to open this assignment as a Jupyter notebook. Please go back to the Jupyter home page and navigate to LDSA-Spark > Lab Assignment.ipynb. The LDSA-Spark folder includes also the Lecture Examples.ipynb notebook that we have seen in the lecture, so you can play with it. However, keep in mind that only one notebook at the time will manage to send tasks to the Spark Master, so make sure to shutdown notebooks when you are not using them.

How to submit the assignment

Please complete the tasks in this assignment editing this notebook, both for the code implementations and the theory questions. If you are not familiar with Jupyter, you may want to give a look to the Notebook Basics tutorial. When you are done with the tasks, please save this notebook with your solutions, download it as .ipynb (File > Download as > Notebook), and upload it to the student portal.

Task 1: DNA G+C precentage

DNA is a molecule that carries genetic information used in the growth, development, functioning and reproduction of all known living organisms (including some viruses). The DNA information is coded in a language of 4 bases: cytosine (C), guanine (G), adenine (A), thymine (T). The percentage of G+C bases in a DNA sequence has valuable biological mening (wikipedia: https://goo.gl/kCLvDp), hence it is important to be able to compute it for long DNA sequences.

Task

Given an input DNA sequence, represented as a text file: data/dna.txt, compute the percentage of g + c occurrences into it. An example follows:

Input file:

atcg
ccgg
ttat

result: $$\frac{C_{count} + G_{count}}{C_{count} + G_{count} + A_{count} + T_{count}} = \frac{6}{12} = 0.5$$

Tip 1: when you load an input file as an RDD, each line will be loaded into a distinct string RDD record. In Scala you can count the occurrences of a certain character in a string as it follows:


In [3]:
"atttccgg".count(c => c == 'g')


Out[3]:
2

Question 1: Is the previous operation parallel, or is computed locally? Why?

Answer goes here

Tip 2: sums form different RDD records can be aggregated using the RDD reduce method. An example follows:


In [4]:
val sumsRDD = sc.parallelize(Array(3,5,2))
sumsRDD.reduce(_+_)


Out[4]:
10

Your solution


In [ ]:
{
// Implementation goes here
}

Question 2: What is an RDD in Spark?

Answer goes here

Question 3: Is reduce a trasformation or an action? What are RDD transformations and RDD actions? How do they differ from each other?

Answer goes here

Question 4: What are some of the advanteges of Spark over Hadoop (and MapReduce)?

Answer goes here

Task 2: Monte Carlo integration

Large dataset analysis is the main use case of Spark. However, Spark can be used to perform compute intensive tasks as well. Numerical integration is a good example problem that falls in this group of use cases.

The Monte Carlo integration method, is a way to get an approximation of the definite integral of a function $f(x)$, over an interval $[A,B]$. Given a value $Max_{f(x)},$ which $f(x)$ never exceeds, we first randomly draw $N$ uniformly distributed points $(x_1,y_2) … (x_N,y_N)$ s.t. $x_1 … x_N \in [A,B], y_1 … y_N \in [0,Max_{f(x)}]$. Then, assuming that $f(x)$ is positive over $[A,B]$, the fraction of points that fell under $f(x)$ will be roughly equal to the area under the curve, divided by the total area of the rectangle in which we randomly drew points. Hence, the definite integral of $f(x)$ over $[A,B]$ is roughly equal to:

$$(B-A) Max_{f(x)}\frac{n_{P}}{tot_{P}},$$

where $n_{P}$ is the number of points that fell under $f(x)$, and $tot_{P}$ is the total number of randomly drawn points.

Task

Write a program in Spark to approximate the definite integral of $f(x) = (1 + sin(x))$ / $cos(x)$ over $[0,1].$ Such function is positive and it is lower than $4$ over $[0,1]$.

For the purpose of this assignment drawing 1000 points is good enough.

Question What does the Spark's parallelize function do? What is it good for?

Answer goes here

Your solution


In [ ]:
{
// Implementation goes here
}

Challenge: Iris flower classification

The well-known Iris flower dataset (wiki: https://goo.gl/OQjope) contains measurements for 150 Iris flowers form 3 different species: Iris setosa, Iris virginica and Iris versicolor. For each example in the dataset, 4 measurements were performed: sepal length, sepal width, petal length and petal width.

You are given a copy of this dataset in Comma Separated Values (CSV) format: data/iris_csv.txt. In the CSV text file each line contains sepal length, sepal width, petal length, petal width and species name separated by a comma. The file looks something like the following example:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica

Task 1: Build and evaluate a 3NN classifier for the Iris flower dataset in Spark. In order to evaluate your classifier, you can save 20% of the data for testing like we did in the lecture examples. For simplicity, you are allowed to collect the test data.

N.B. Part of the challenge is to figure out how to parse the input dataset into an RDD. Google is your friend!

Your solution


In [ ]:
{
// Implementation goes here
}